19 research outputs found
PKU-GoodsAD: A Supermarket Goods Dataset for Unsupervised Anomaly Detection and Segmentation
Visual anomaly detection is essential and commonly used for many tasks in the
field of computer vision. Recent anomaly detection datasets mainly focus on
industrial automated inspection, medical image analysis and video surveillance.
In order to broaden the application and research of anomaly detection in
unmanned supermarkets and smart manufacturing, we introduce the supermarket
goods anomaly detection (GoodsAD) dataset. It contains 6124 high-resolution
images of 484 different appearance goods divided into 6 categories. Each
category contains several common different types of anomalies such as
deformation, surface damage and opened. Anomalies contain both texture changes
and structural changes. It follows the unsupervised setting and only normal
(defect-free) images are used for training. Pixel-precise ground truth regions
are provided for all anomalies. Moreover, we also conduct a thorough evaluation
of current state-of-the-art unsupervised anomaly detection methods. This
initial benchmark indicates that some methods which perform well on the
industrial anomaly detection dataset (e.g., MVTec AD), show poor performance on
our dataset. This is a comprehensive, multi-object dataset for supermarket
goods anomaly detection that focuses on real-world applications.Comment: 8 pages, 6 figure
GATOR: Graph-Aware Transformer with Motion-Disentangled Regression for Human Mesh Recovery from a 2D Pose
3D human mesh recovery from a 2D pose plays an important role in various
applications. However, it is hard for existing methods to simultaneously
capture the multiple relations during the evolution from skeleton to mesh,
including joint-joint, joint-vertex and vertex-vertex relations, which often
leads to implausible results. To address this issue, we propose a novel
solution, called GATOR, that contains an encoder of Graph-Aware Transformer
(GAT) and a decoder with Motion-Disentangled Regression (MDR) to explore these
multiple relations. Specifically, GAT combines a GCN and a graph-aware
self-attention in parallel to capture physical and hidden joint-joint
relations. Furthermore, MDR models joint-vertex and vertex-vertex interactions
to explore joint and vertex relations. Based on the clustering characteristics
of vertex offset fields, MDR regresses the vertices by composing the predicted
base motions. Extensive experiments show that GATOR achieves state-of-the-art
performance on two challenging benchmarks.Comment: Accepted by ICASSP 202
Edge-guided Representation Learning for Underwater Object Detection
Underwater object detection (UOD) is crucial for marine economic development,
environmental protection, and the planet's sustainable development. The main
challenges of this task arise from low-contrast, small objects, and mimicry of
aquatic organisms. The key to addressing these challenges is to focus the model
on obtaining more discriminative information. We observe that the edges of
underwater objects are highly unique and can be distinguished from low-contrast
or mimicry environments based on their edges. Motivated by this observation, we
propose an Edge-guided Representation Learning Network, termed ERL-Net, that
aims to achieve discriminative representation learning and aggregation under
the guidance of edge cues. Firstly, we introduce an edge-guided attention
module to model the explicit boundary information, which generates more
discriminative features. Secondly, a feature aggregation module is proposed to
aggregate the multi-scale discriminative features by regrouping them into three
levels, effectively aggregating global and local information for locating and
recognizing underwater objects. Finally, we propose a wide and asymmetric
receptive field block to enable features to have a wider receptive field,
allowing the model to focus on more small object information. Comprehensive
experiments on three challenging underwater datasets show that our method
achieves superior performance on the UOD task
Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video
Despite significant progress in single image-based 3D human mesh recovery,
accurately and smoothly recovering 3D human motion from a video remains
challenging. Existing video-based methods generally recover human mesh by
estimating the complex pose and shape parameters from coupled image features,
whose high complexity and low representation ability often result in
inconsistent pose motion and limited shape patterns. To alleviate this issue,
we introduce 3D pose as the intermediary and propose a Pose and Mesh
Co-Evolution network (PMCE) that decouples this task into two parts: 1)
video-based 3D human pose estimation and 2) mesh vertices regression from the
estimated 3D pose and temporal image feature. Specifically, we propose a
two-stream encoder that estimates mid-frame 3D pose and extracts a temporal
image feature from the input image sequence. In addition, we design a
co-evolution decoder that performs pose and mesh interactions with the
image-guided Adaptive Layer Normalization (AdaLN) to make pose and mesh fit the
human body shape. Extensive experiments demonstrate that the proposed PMCE
outperforms previous state-of-the-art methods in terms of both per-frame
accuracy and temporal consistency on three benchmark datasets: 3DPW, Human3.6M,
and MPI-INF-3DHP. Our code is available at https://github.com/kasvii/PMCE.Comment: Accepted by ICCV 2023. Project page: https://kasvii.github.io/PMC
A compact representation of human actions by sliding coordinate coding
Human action recognition remains challenging in realistic videos, where scale and viewpoint changes make the problem complicated. Many complex models have been developed to overcome these difficulties, while we explore using low-level features and typical classifiers to achieve the state-of-the-art performance. The baseline model of feature encoding for action recognition is bag-of-words model, which has shown high efficiency but ignores the arrangement of local features. Refined methods compensate for this problem by using a large number of co-occurrence descriptors or a concatenation of the local distributions in designed segments. In contrast, this article proposes to encode the relative position of visual words using a simple but very compact method called sliding coordinates coding (SCC). The SCC vector of each kind of word is only an eight-dimensional vector which is more compact than many of the spatial or spatial–temporal pooling methods in the literature. Our key observation is that the relative position is robust to the variations of video scale and view angle. Additionally, we design a temporal cutting scheme to define the margin of coding within video clips, since visual words far away from each other have little relationship. In experiments, four action data sets, including KTH, Rochester Activities, IXMAS, and UCF YouTube, are used for performance evaluation. Results show that our method achieves comparable or better performance than the state of the art, while using more compact and less complex models.Published versio
A Probabilistic Method of Bearing-only Localization by Using Omnidirectional Vision Signal Processing
Self-localization is the most necessary foundation of mobile robot's navigation system. Omnidirectional vision system has been widely used in mobile robot. However, the accuracy of vision-based methods is not enough for many applications. In this paper, based on bearing-only localization, a probabilistic method is presented to accurately estimate mobile robot's position making use of omnidirectional vision signal processing. Firstly, landmarks and bearings are detected by omnidirectional vision signals. Then, a probabilistic mapping composed of coordinates' weight is constructed. The weight indicates the possibility that the robot's location is the coordinate of the point. Finally, Monte Carlo Localization is used to merge this probabilistic mapping with odometer-based prediction model. Experiments in several scenarios show that our method using omnidirectional vision signal processing is robust and accurate for mobile robot localization. ? 2012 IEEE.EI
An Adaptive Method Based on Multiscale Dilated Convolutional Network for Binaural Speech Source Localization
Most binaural speech source localization models perform poorly in unprecedentedly noisy and reverberant situations. Here, this issue is approached by modelling a multiscale dilated convolutional neural network (CNN). The time-related crosscorrelation function (CCF) and energy-related interaural level differences (ILD) are preprocessed in separate branches of dilated convolutional network. The multiscale dilated CNN can encode discriminative representations for CCF and ILD, respectively. After encoding, the individual interaural representations are fused to map source direction. Furthermore, in order to improve the parameter adaptation, a novel semiadaptive entropy is proposed to train the network under directional constraints. Experimental results show the proposed method can adaptively locate speech sources in simulated noisy and reverberant environments
Combining adaptive hierarchical depth motion maps with skeletal joints for human action recognition
This paper presents a new framework for human action recognition by fusing human motion with skeletal joints. First, adaptive hierarchical depth motion maps (AH-DMMs) are proposed to capture the shape and motion cues of action sequences. Specifically, AH-DMMs are calculated over adaptive hierarchical windows and Gabor filters are used to encode the texture information of AH-DMMs. Then, spatial distances of skeletal joint positions are computed to characterize the structure information of the human body. Finally, two types of fusion methods including feature-level fusion and decision-level fusion are employed to combine the motion cues and structure information. The experimental results on public benchmark datasets, i.e., MSRAction3D and UTKinect-Action, show the effectiveness of the proposed method.Published versio